This is SNPfiltR v.1.0.2
Detailed usage information is available at: devonderaad.github.io/SNPfiltR/
If you use SNPfiltR in your published work, please cite the following papers:
DeRaad, D.A. (2022), SNPfiltR: an R package for interactive and reproducible SNP filtering. Molecular Ecology Resources, 22, 2443-2453. http://doi.org/10.1111/1755-0998.13618
Knaus, Brian J., and Niklaus J. Grunwald. 2017. VCFR: a package to manipulate and visualize variant call format data in R. Molecular Ecology Resources, 17.1:44-53. http://doi.org/10.1111/1755-0998.12549
library(StAMPP) #v1.6.3
Loading required package: pegas
Loading required package: ape
Registered S3 method overwritten by 'pegas':
method from
print.amova ade4
Attaching package: 'pegas'
The following object is masked from 'package:ape':
mst
The following object is masked from 'package:ade4':
amova
The following objects are masked from 'package:vcfR':
getINFO, write.vcf
We will follow the SNP filtering protocol outlined by the SNPfiltR R package in [@deraad2022].
0.2 Read in data
We will read in the unfiltered SNPs for all samples output as a vcf file after using BWA [@li2009] to map our raw RADseq data to the Thryothorus ludovicianus (Carolina wren) reference genome (available at this NCBI link; [@feng2020]).
#read in vcfvcfR <-read.vcfR("~/Desktop/marsh.wren.phylogeography/populations.snps.vcf")
#how many samples and SNPs did we get from mapping all samples to the CAWR reference genome?vcfR
***** Object of Class vcfR *****
201 samples
5726 CHROMs
331,667 variants
Object size: 1239.9 Mb
75.72 percent missing data
***** ***** *****
#read in sample info csvsample.info<-read.csv("~/Desktop/marsh.wren.phylogeography/MAWR.phylogeography.full.sampling.csv")#reorder sampling file to match order of samples in vcfsample.info<-sample.info[match(colnames(vcfR@gt)[-1], sample.info$Sample),]#make sure sampling file matches the order of samples in your vcfsample.info$Sample ==colnames(vcfR@gt)[-1]
1 Implement quality filters that don’t involve missing data
This is because removing low data samples will alter percentage/quantile based missing data cutoffs, so we wait to implement those until after deciding on our final set of samples for downstream analysis
#hard filter to minimum depth of 5, and minimum genotype quality of 30vcfR<-hard_filter(vcfR=vcfR, depth =3, gq =30)
21.28% of genotypes fall below a read depth of 3 and were converted to NA
6.22% of genotypes fall below a genotype quality of 30 and were converted to NA
Use the function allele_balance() to filter for allele balance from Puritz SNP filtering tutorial “Allele balance: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous, we expect that the allele balance in our data (for real loci) should be close to 0.5”
0.18% of het genotypes (0.01% of all genotypes) fall outside of 0.1 - 0.9 allele balance ratio and were converted to NA
We now want to implement a max depth filter (super high depth loci are likely multiple loci stuck together into a single paralogous locus, which we want to remove).
#visualize and pick appropriate max depth cutoffmax_depth(vcfR)
cutoff is not specified, exploratory visualization will be generated.
dashed line indicates a mean depth across all SNPs of 29.4
#filter vcf by the max depth cutoff you chosevcfR<-max_depth(vcfR, maxdepth =250)
maxdepth cutoff is specified, filtered vcfR object will be returned
2.65% of SNPs were above a mean depth of 250 and were removed from the vcf
Remove SNPs from the vcfR that have become invariant following the removal of questionable genotypes above, and see how many SNPs we have left after this initial set of filters
vcfR<-min_mac(vcfR, min.mac =1)
27.48% of SNPs fell below a minor allele count of 1 and were removed from the VCF
vcfR
***** Object of Class vcfR *****
201 samples
5330 CHROMs
234,156 variants
Object size: 882.3 Mb
80.5 percent missing data
***** ***** *****
2 Setting the missing data by sample threshold
#run function to visualize per sample missingnessmiss<-missing_by_sample(vcfR)
No popmap provided
#run function to visualize per SNP missingnessby.snp<-missing_by_snp(vcfR)
cutoff is not specified, exploratory visualizations will be generated
Based on these visualizations, we will want to drop the worst sequenced samples that are dragging down the rest of the dataset. Drop those samples based on an approximate missing data proportion cutoff here (this can always be revised if we end up feeling like this is too lenient or stringent later):
#run function to drop samples above the threshold we want from the vcfvcfR.trim<-missing_by_sample(vcfR=vcfR, cutoff = .9)
16 samples are above a 0.9 missing data cutoff, and were removed from VCF
#remove invariant sites generated by sample trimmingvcfR.trim<-min_mac(vcfR.trim, min.mac =1)
0.6% of SNPs fell below a minor allele count of 1 and were removed from the VCF
3 Setting the missing data by SNP threshold
Now we will visualize different per SNP missing data thresholds and identify a value that optimizes the trade-off between amount of missing data and the total number of SNPs retained.
#see what effect trimming samples had on missing data across the datasetby.snp<-missing_by_snp(vcfR.trim)
cutoff is not specified, exploratory visualizations will be generated
Picking joint bandwidth of 0.051
vcf.85<-missing_by_snp(vcfR.trim, cutoff=.85)
cutoff is specified, filtered vcfR object will be returned
90.77% of SNPs fell below a completeness cutoff of 0.85 and were removed from the VCF
miss<-missing_by_sample(vcf.85)
No popmap provided
#no samples would have > 50% missing data at an 85% completeness cutoff, so I am happy with thatvcf.90<-missing_by_snp(vcfR.trim, cutoff=.9)
cutoff is specified, filtered vcfR object will be returned
92.45% of SNPs fell below a completeness cutoff of 0.9 and were removed from the VCF
miss<-missing_by_sample(vcf.90)
No popmap provided
#no samples would have > 40% missing data at an 90% completeness cutoff
Investigate the effect of missing data
#I will now investigate and make sure sample clustering isnt being obviously driven by missing data#make a splitstree for the 90% completeness dataset#convert to genlightgen<-vcfR2genlight(vcf.90)#assign populations (a StaMPP requirement)gen@pop<-as.factor(gen@ind.names)gen@ind.names<-gsub("2511-03326_rep2","R2_03326",gen@ind.names)#generate pairwise divergence matrixsample.div <-stamppNeisD(gen, pop =FALSE)#export for splitstreestamppPhylip(distance.mat=sample.div, file="~/Desktop/marsh.wren.phylogeography/90.splits.txt")#filter to 100% completeness and make a splitstreevcf.100<-missing_by_snp(vcfR.trim, cutoff =1)
cutoff is specified, filtered vcfR object will be returned
99.37% of SNPs fell below a completeness cutoff of 1 and were removed from the VCF
#convert to genlightgen<-vcfR2genlight(vcf.100)#assign populations (a StaMPP requirement)gen@pop<-as.factor(gen@ind.names)gen@ind.names<-gsub("2511-03326_rep2","R2_03326",gen@ind.names)#generate pairwise divergence matrixsample.div <-stamppNeisD(gen, pop =FALSE)#export for splitstreestamppPhylip(distance.mat=sample.div, file="~/Desktop/marsh.wren.phylogeography/100.splits.txt")
90% complete splitstree (~18K SNPs)
100% complete splitstree (~1,500 SNPs)
We can see that the percentage of missing data is not causing the intermediate clustering, specifically for the one obvious hybrid sample. Additionally, the samples from North Dakota and Saskatchewan are consistently leaking toward the middle, even when there is no missing data in the input SNP matrix, indicating that these samples have real mismatched alleles, not just excess missing data and insufficient signal to assign them.
4 Remove technical replicates
We can also see that for the one sample for which both replicates passed filtering requirements (2511-03326), the two technical replicates are nearly identical in our 90% complete splitstree
In fact, we can see exactly how many genotypes were called the same between the two samples, as an estimate of error rate
#calculate number of conflicting genotype calls between technical replicatestable(gsub(":.*","",(vcf.90@gt[,colnames(vcf.90@gt) =="2511-03326_rep1"])) ==gsub(":.*","",(vcf.90@gt[,colnames(vcf.90@gt) =="2511-03326_rep2"])))
FALSE TRUE
3 17438
#isolate the exact genotypes that conflictz<-gsub(":.*","",(vcf.90@gt[,colnames(vcf.90@gt) =="2511-03326_rep1"|colnames(vcf.90@gt) =="2511-03326_rep2"]))[gsub(":.*","",(vcf.90@gt[,colnames(vcf.90@gt) =="2511-03326_rep1"])) !=gsub(":.*","",(vcf.90@gt[,colnames(vcf.90@gt) =="2511-03326_rep2"])),]#print them with missing values removedz[complete.cases(z), ]
Only 3 genotypes differ between these two replicates across > 17K called SNPs. This suggests very low (~0.017%) variability in genotype calls in our final dataset due to non-biological reasons. Looking at the genotypes, the variation is exclusively in genotypes where the caller disagreed about whether to call a homozygous versus het genotype, the types of calls where confidence can vary depending on read depth and random sampling error of the two alleles. This is overall a very encouraging sign for the quality of this dataset.
We will now remove the technical replicate with the most missing data
vcf.90<-vcf.90[,colnames(vcf.90@gt) !="2511-03326_rep2"]#remove any potentially invariant SNPsvcf.90<-min_mac(vcf.90, min.mac =1)
0.02% of SNPs fell below a minor allele count of 1 and were removed from the VCF
5 Remove overlapping SNPs
It is a known thing (see this) that Stacks will not merge SNPs called twice if they are sequenced by separate (but physically overlapping) loci. To account for this, we will simply remove a SNP every time its chromosome and position have already been seen in the dataset with the following code:
#generate dataframe containing information for chromosome and bp locality of each SNPdf<-as.data.frame(vcf.90@fix[,1:2])#calc number of duplicated SNPs to removenrow(df) -length(unique(paste(df$CHROM,df$POS)))